IGNITE-28671 Describe healthy cluster behavior in general tips guide by w3ll1ngt · Pull Request #13130 · apache/ignite

w3ll1ngt · 2026-05-13T20:21:58Z

Thank you for submitting the pull request to the Apache Ignite.

In order to streamline the review of the contribution
we ask you to ensure the following steps have been taken:

The Contribution Checklist

There is a single JIRA ticket related to the pull request.
The web-link to the pull request is attached to the JIRA ticket.
The JIRA ticket has the Patch Available state.
The pull request body describes changes that have been made.
The description explains WHAT and WHY was made instead of HOW.
The pull request title is treated as the final commit message.
The following pattern must be used: IGNITE-XXXX Change summary where XXXX - number of JIRA issue.
A reviewer has been mentioned through the JIRA comments
(see the Maintainers list)
The pull request has been checked by the Teamcity Bot and
the green visa attached to the JIRA ticket (see tab PR Check at TC.Bot - Instance 1 or TC.Bot - Instance 2)

Notes

If you need any help, please email dev@ignite.apache.org or ask anу advice on http://asf.slack.com #ignite channel.

zstan · 2026-05-14T05:34:25Z


+== What healthy cluster behavior looks like
+
+A healthy Ignite cluster is not defined by a single latency, CPU, or memory number. In practice, it is a cluster whose topology is stable, whose cluster state and baseline match the intended deployment, whose partitions are not lost or divergent, whose rebalancing and checkpointing complete in bounded time, and whose execution queues and memory pools return to a steady level after short-lived spikes. Ignite exposes these signals through built-in metrics, system views, and the control script rather than through a single aggregate health score.


A healthy Ignite cluster is not defined by a single latency, CPU, or memory number. - very strange comment

yea, thx. Looks wierd at second glance, i agree. i was intended to say smth like: "healthy cluster could not bejust defined by some simple memory metrics (number), but rather... whole complex system etc"

zstan · 2026-05-14T05:35:26Z

+
+A healthy Ignite cluster is not defined by a single latency, CPU, or memory number. In practice, it is a cluster whose topology is stable, whose cluster state and baseline match the intended deployment, whose partitions are not lost or divergent, whose rebalancing and checkpointing complete in bounded time, and whose execution queues and memory pools return to a steady level after short-lived spikes. Ignite exposes these signals through built-in metrics, system views, and the control script rather than through a single aggregate health score.
+
+When checking whether a cluster is healthy, start with topology and cluster state. The cluster should be in the expected state, usually ACTIVE, and the number of server and client nodes should be stable. If native persistence is enabled, the baseline should also be in the expected shape: for a stable deployment, the nodes that are expected to be online should appear online both in baseline-related metrics and in the SYS.BASELINE_NODES system view. Frequent unexpected topology changes are not normal and should be treated as a sign of node instability or network problems.


usually ACTIVE - need cross link to ACTIVE

thank you, links added

zstan · 2026-05-14T05:36:37Z

+
+Checkpointing and transactions should also remain bounded. Checkpoint activity can slow the cluster down, so LastCheckpointDuration should be monitored together with dirty pages and disk behavior. Transactions and queries can legitimately take longer during bursts, but healthy steady-state behavior means that lock-holding transactions, long-running transactions, and long-running SQL queries do not accumulate over time. If long transactions repeatedly block partition map exchange, use transaction timeout settings such as TxTimeoutOnPartitionMapExchange and investigate the application path that keeps transactions open.
+
+Finally, check the underlying JVM and critical workers. Ignite treats IgniteOutOfMemoryException, OutOfMemoryError, system worker termination, system worker hangs, and cluster node segmentation as critical failures. A healthy cluster should not emit blocked system-critical worker messages, and JVM resource pools should stay comfortably below exhaustion. In practice, monitor heap usage, direct buffer usage, and open file descriptors continuously, because all three are finite pools and approaching their limits usually means the node is already close to a failure condition rather than merely under benign load.


as critical failures - and what does it mean ?

Completely refurbish this phrase. Before that, i meant to think about critical failures as smth that triggers FailureHandler. Thank you

sonarqubecloud · 2026-05-14T05:41:50Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

IGNITE-28671 Describe healthy cluster behavior in general tips guide

8acce78

zstan reviewed May 14, 2026

View reviewed changes

w3ll1ngt and others added 2 commits May 14, 2026 22:53

fix review, add cross links, reduce cdc section, add minors

ed59869

IGNITE-28671 Simplify adoc links

f3e3216

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IGNITE-28671 Describe healthy cluster behavior in general tips guide#13130

IGNITE-28671 Describe healthy cluster behavior in general tips guide#13130
w3ll1ngt wants to merge 3 commits into
apache:masterfrom
w3ll1ngt:ignite-28671

w3ll1ngt commented May 13, 2026

Uh oh!

zstan May 14, 2026

Uh oh!

w3ll1ngt May 18, 2026

Uh oh!

zstan May 14, 2026

Uh oh!

w3ll1ngt May 18, 2026

Uh oh!

zstan May 14, 2026

Uh oh!

w3ll1ngt May 18, 2026

Uh oh!

sonarqubecloud Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		== What healthy cluster behavior looks like

		A healthy Ignite cluster is not defined by a single latency, CPU, or memory number. In practice, it is a cluster whose topology is stable, whose cluster state and baseline match the intended deployment, whose partitions are not lost or divergent, whose rebalancing and checkpointing complete in bounded time, and whose execution queues and memory pools return to a steady level after short-lived spikes. Ignite exposes these signals through built-in metrics, system views, and the control script rather than through a single aggregate health score.


		A healthy Ignite cluster is not defined by a single latency, CPU, or memory number. In practice, it is a cluster whose topology is stable, whose cluster state and baseline match the intended deployment, whose partitions are not lost or divergent, whose rebalancing and checkpointing complete in bounded time, and whose execution queues and memory pools return to a steady level after short-lived spikes. Ignite exposes these signals through built-in metrics, system views, and the control script rather than through a single aggregate health score.

		When checking whether a cluster is healthy, start with topology and cluster state. The cluster should be in the expected state, usually ACTIVE, and the number of server and client nodes should be stable. If native persistence is enabled, the baseline should also be in the expected shape: for a stable deployment, the nodes that are expected to be online should appear online both in baseline-related metrics and in the SYS.BASELINE_NODES system view. Frequent unexpected topology changes are not normal and should be treated as a sign of node instability or network problems.


		Checkpointing and transactions should also remain bounded. Checkpoint activity can slow the cluster down, so LastCheckpointDuration should be monitored together with dirty pages and disk behavior. Transactions and queries can legitimately take longer during bursts, but healthy steady-state behavior means that lock-holding transactions, long-running transactions, and long-running SQL queries do not accumulate over time. If long transactions repeatedly block partition map exchange, use transaction timeout settings such as TxTimeoutOnPartitionMapExchange and investigate the application path that keeps transactions open.

		Finally, check the underlying JVM and critical workers. Ignite treats IgniteOutOfMemoryException, OutOfMemoryError, system worker termination, system worker hangs, and cluster node segmentation as critical failures. A healthy cluster should not emit blocked system-critical worker messages, and JVM resource pools should stay comfortably below exhaustion. In practice, monitor heap usage, direct buffer usage, and open file descriptors continuously, because all three are finite pools and approaching their limits usually means the node is already close to a failure condition rather than merely under benign load.

Conversation

w3ll1ngt commented May 13, 2026

The Contribution Checklist

Notes

Uh oh!

zstan May 14, 2026

Choose a reason for hiding this comment

Uh oh!

w3ll1ngt May 18, 2026

Choose a reason for hiding this comment

Uh oh!

zstan May 14, 2026

Choose a reason for hiding this comment

Uh oh!

w3ll1ngt May 18, 2026

Choose a reason for hiding this comment

Uh oh!

zstan May 14, 2026

Choose a reason for hiding this comment

Uh oh!

w3ll1ngt May 18, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented May 14, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants